-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql,kvserver: stop gossiping the system config #76279
sql,kvserver: stop gossiping the system config #76279
Conversation
303f582
to
ce08f27
Compare
ce08f27
to
dea4797
Compare
ccddacc
to
37e04e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 59 of 72 files at r1, 6 of 6 files at r2, 7 of 7 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @miretskiy, @nvanbenschoten, and @stevendanna)
pkg/clusterversion/cockroach_versions.go, line 274 at r3 (raw file):
// DisableSystemConfigGossipTrigger is a follow-up to EnableSpanConfigStore // to disable the data propagation mechanism it and the entire spanconfig // infrastructure obviate.
obviates?
pkg/config/system.go, line 193 at r3 (raw file):
func (s *SystemConfig) getIndexBound(key roachpb.Key) int { if s == nil { panic("here")
Looks like debugging holdover.
pkg/sql/plan.go, line 545 at r3 (raw file):
func (p *planner) maybeSetSystemConfig(id descpb.ID) error { if !descpb.IsSystemConfigID(id) || p.execCfg.Settings.Version.IsActive( p.EvalContext().Ctx(), clusterversion.DisableSystemConfigGossipTrigger,
Nice.
pkg/kv/kvserver/client_lease_test.go, line 109 at r3 (raw file):
zcfg := zonepb.DefaultZoneConfig() settings := cluster.MakeTestingClusterSettingsWithVersions( clusterversion.TestingBinaryMinSupportedVersion,
[pedant] Here and below: should these test be running at DisableSystemConfigGossipTrigger - 1?
pkg/kv/kvserver/client_split_test.go, line 1021 at r3 (raw file):
ctx := context.Background() serv, _, _ := serverutils.StartServer(t, base.TestServerArgs{ // This test was written with the SystemConfigSpan in mind.
There are a handful of tests that make use of DisableSpanConfigs. Some of them, like this one, were too specific to the system config span. Others were just flakey with span config asynchrony and are still being investigated by the home teams. What were you thinking to do for the latter kind?
pkg/kv/kvserver/client_split_test.go, line 3384 at r3 (raw file):
ctx := context.Background() serv, sqlDB, _ := serverutils.StartServer(t, base.TestServerArgs{ DisableSpanConfigs: true,
Do we still need this testing flag?
9a5e9d7
to
a3d9493
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @arulajmani, @irfansharif, @miretskiy, @nvanbenschoten, and @stevendanna)
pkg/clusterversion/cockroach_versions.go, line 274 at r3 (raw file):
Previously, irfansharif (irfan sharif) wrote…
obviates?
Done.
pkg/config/provider.go, line 28 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This comment is now stale.
Done.
pkg/config/provider.go, line 38 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider keeping this comment.
Done.
Code quote:
// The system config will never be updated, so return a nil channel.
pkg/gossip/gossip.go, line 1151 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is it worth opening an issue to track all of these follow-up tasks?
pkg/gossip/keys.go, line 73 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
In a cluster that only performs rolling upgrades and that adds information to gossip with no ttl, can we guarantee that the information won't stick around in the gossip network forever? If not, then we should probably mark this as a retired gossip key but keep it around in v22.2, instead of removing it entirely, as it would be incorrect to ever use the key again.
For now, I've adopted your suggestion, though it doesn't seem too wild to write a migration to just remove the information.
pkg/migration/migrations/descriptor_utils.go, line 60 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What allows us to remove this unconditionally?
We haven't added any system tables before the migration which enables all of the span config infrastructure. It's admittedly subtle.
pkg/server/node.go, line 1333 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you add a sentence about why we need to continue to support this?
Well, I said something, but it lead to me thinking more deeply about it, and I'm not pleased with the current situation. https://cockroachlabs.slack.com/archives/C020NSBH63F/p1644965222776869
pkg/server/node.go, line 1337 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you remind me what the migration story is for multi-tenant clusters? By the time we roll out a v22.2 host cluster, all SQL pods will be running v22.1?
That's correct.
pkg/kv/kvserver/client_lease_test.go, line 109 at r3 (raw file):
Previously, irfansharif (irfan sharif) wrote…
[pedant] Here and below: should these test be running at DisableSystemConfigGossipTrigger - 1?
Done.
pkg/config/system.go, line 193 at r3 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Looks like debugging holdover.
Done.
TFTR! bors r+ |
Build failed (retrying...): |
Sorry, you'll need to rebase because of #76598 bors r- |
Canceled. |
There were a pair of tests that didn't feel worth carrying along. Follow up commit to remove the cluster setting. Release note: None
This cluster setting is no longer useful. Release note: None
a3d9493
to
0e45b5f
Compare
bors r+ |
Build succeeded: |
Follow-on from cockroachdb#76279. Fixes cockroachdb#75864. This commit adds a mechanism to combine the system config data of the tenant with the data provided over the GossipSubscription from the system tenant. It then plumbs a version into the zone config methods. In the mixed version state, the tenant uses the existing override from the system tenant. After the span config infrastructure has been activated, the tenant uses the overrides they've set. This affects, realistically, just the GC job, and, to a lesser extent, the optimizer. Release note: None
Follow-on from cockroachdb#76279. Fixes cockroachdb#75864. This commit adds a mechanism to combine the system config data of the tenant with the data provided over the GossipSubscription from the system tenant. It then plumbs a version into the zone config methods. In the mixed version state, the tenant uses the existing override from the system tenant. After the span config infrastructure has been activated, the tenant uses the overrides they've set. This affects, realistically, just the GC job, and, to a lesser extent, the optimizer. Release note: None
Follow-on from cockroachdb#76279. Fixes cockroachdb#75864. This commit adds a mechanism to combine the system config data of the tenant with the data provided over the GossipSubscription from the system tenant. It then plumbs a version into the zone config methods. In the mixed version state, the tenant uses the existing override from the system tenant. After the span config infrastructure has been activated, the tenant uses the overrides they've set. This affects, realistically, just the GC job, and, to a lesser extent, the optimizer. Release note: None
76504: *: fix system config for tenant r=ajwerner a=ajwerner Follow-on from #76279. Fixes #75864. This commit adds a mechanism to combine the system config data of the tenant with the data provided over the GossipSubscription from the system tenant. It then plumbs a version into the zone config methods. In the mixed version state, the tenant uses the existing override from the system tenant. After the span config infrastructure has been activated, the tenant uses the overrides they've set. This affects, realistically, just the GC job, and, to a lesser extent, the optimizer. Release note: None 78595: Week -1: follow-up emails r=rail a=celiala This PR adds follow-up emails for: - the day before `PrepDate` (to be sent if only there are still blockers) - the day of `PrepDate` (to be sent if only there are still blockers) - the day after `PrepDate` (to be sent if only there are still blockers) The `post-blockers` command should be used for all emails before and up to `PrepDate`: ``` # where `DAYS_BEFORE_PREP` is one of: many-days | day-before | day-of SMTP_PASSWORD=$MY_SMTP_PWD SMTP_USER=$MY_SMTP_USER GITHUB_TOKEN=$MY_GITHUB_TOKEN \ $(bazel info bazel-bin)/pkg/cmd/release/release_/release post-blockers \ --release-series 21.1 \ --smtp-user $MY_SMTP_USER --smtp-host smtp.gmail.com --smtp-port 587 \ --to $MY_TO_EMAIL \ --template-dir ./pkg/cmd/release/templates \ --prep-date "Apr 1, 2022" --publish-date "May 1, 2022" \ --days-before-prep-date $DAYS_BEFORE_PREP ``` The `cancel-release-date` should be used for the day after `PrepDate` email: ``` SMTP_PASSWORD=$MY_SMTP_PWD SMTP_USER=$MY_SMTP_USER GITHUB_TOKEN=$MY_GITHUB_TOKEN \ $(bazel info bazel-bin)/pkg/cmd/release/release_/release cancel-release-date \ --release-series 21.1 \ --smtp-user $MY_SMTP_USER --smtp-host smtp.gmail.com --smtp-port 587 \ --to $MY_TO_EMAIL \ --template-dir ./pkg/cmd/release/templates \ --publish-date "May 1, 2022" --next-publish-date "Jun 1, 2022" ``` ### TeamCity changes This will be paired with the following TeamCity jobs: * [Send release blockers - 78595](https://teamcity.cockroachdb.com/buildConfiguration/Internal_Cockroach_Release_Process_SendReleaseBlockers_78595?branch=78595&buildTypeTab=overview&mode=builds#all-projects) * [Cancel release date - 78595](https://teamcity.cockroachdb.com/buildConfiguration/Internal_Cockroach_Release_Process_CancelReleaseDate_78595?branch=78595&mode=builds#all-projects) ![image](https://user-images.githubusercontent.com/3051672/160359980-93ac438f-d8e6-4f6c-a0e0-00fab55b93bc.png) #### Manual tests ##### Test 1: post-blockers --days-before-prep-date many-days <img width="300" src="https://user-images.githubusercontent.com/3051672/160353978-12e8d2e9-1900-471b-b785-f247d19f474a.png"> ##### Test 2: post-blockers --days-before-prep-date day-before <img width="300" src="https://user-images.githubusercontent.com/3051672/160353943-8514a0d4-87f4-4ab7-8d67-78bb42f427c2.png"> ##### Test 3: post-blockers --days-before-prep-date day-of <img width="300" src="https://user-images.githubusercontent.com/3051672/160353904-d18e6d68-594c-48dc-8308-6f0c3621fe91.png"> ##### Test 4: cancel-release-date <img width="300" src="https://user-images.githubusercontent.com/3051672/160353668-1e2886f5-c39f-4d54-884d-f967ba06a114.png"> 79551: roachtest: use SQL comparison in TLP r=michae2 a=michae2 Assists: #77279 TLP compares results of two queries, one unpartitioned and one partitioned three ways by a predicate. We were comparing TLP results by printing values to strings. Unfortunately there are some values which are equal according to SQL comparison, but print differently. And TLP, being the nefarious chaos monkey that it is, tends to find these. For example: ```sql 0::FLOAT -0::FLOAT 'hello' COLLATE "en-US-u-ks-level2" 'HELLO' COLLATE "en-US-u-ks-level2" INTERVAL '1 day' INTERVAL '24 hours' ``` When equal-yet-different values like these are used in MAX or MIN aggregations, or are the keys for GROUP BY bucketing, it is nondeterministic which value will be chosen. (This is also true in PostgreSQL.) And sometimes the two TLP queries choose differently. This is not indicative of a bug, yet TLP fails. So this patch teaches TLP to use SQL comparison instead of string comparison. To do so, we wrap both queries in a bigger query: ```sql WITH unpart AS MATERIALIZED ( <unpartitioned query> ), part AS MATERIALIZED ( <partitioned query> ), undiff AS ( TABLE unpart EXCEPT ALL TABLE part ), diff AS ( TABLE part EXCEPT ALL TABLE unpart ) SELECT (SELECT count(*) FROM undiff), (SELECT count(*) FROM diff) ``` Only if this query returns counts other than 0 (meaning the SQL comparison detected a real mismatch) do we diff the printed results. Release note: None 79767: util/contextutil: clarify `RunWithTimeout` error message r=erikgrinaker a=erikgrinaker `RunWithTimeout` gave an error message of this form: ``` operation "send-snapshot" timed out after 1m (took 1m): context deadline exceeded ``` However, this can be misleading because either the caller or callee can set their own context timeout that is less than then one set by `RunWithTimeout`, in which case it looks like: ``` operation "send-snapshot" timed out after 1m (took 1s): context deadline exceeded ``` This is even worse on old versions, where the "took 1s" is not included, leading the reader to believe the operation actually took 1m when it only took 1s. This patch clarifies the error message somewhat: ``` operation "send-snapshot" timed out after 1s (given timeout 1m): context deadline exceeded ``` Resolves #79424. Release note: None 79787: protectedts: fix panic when fetching protectedts records r=adityamaru a=adityamaru Previously, we incorrectly assumed that all records in the `system.protectedts_records` table would have a non NULL target column. While this is enforced for all protected timestamp records written in a 22.1+ cluster, this is not true for records that might have been prior to the 22.1 cluster version being finalized. This change makes the method responsible for reading protectedts records more defensive when encountering a NULL target column. Fixes: #79684 Release note: None 79796: Publish cockroach-sql in publish bleeding edge r=rickystewart a=rail Previously, the `cockroach-sql` tool was published only as a part of the release workflow. The first commit refactors the way how we publish and test. This is very similar to what we use in publish-provisional-artifacts. The second commit adds the cockroach-sql binary to the list of artifacts to publish. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Celia La <[email protected]> Co-authored-by: Michael Erickson <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Rail Aliiev <[email protected]>
Follow-on from cockroachdb#76279. Fixes cockroachdb#75864. This commit adds a mechanism to combine the system config data of the tenant with the data provided over the GossipSubscription from the system tenant. It then plumbs a version into the zone config methods. In the mixed version state, the tenant uses the existing override from the system tenant. After the span config infrastructure has been activated, the tenant uses the overrides they've set. This affects, realistically, just the GC job, and, to a lesser extent, the optimizer. Release note: None
There is still some work left to actually remove `SystemConfigSpan` or at least update `SystemConfigSpan` to contain just system.descriptor and system.zones. cockroachdb#76279 Release note: None
There is still some work left to actually remove `SystemConfigSpan` or at least update `SystemConfigSpan` to contain just system.descriptor and system.zones. cockroachdb#76279 Release note: None
The first commit removes the system config gossip trigger and the client code to set it,
after a version gate has been passed. The primary major change is that in 22.1 binaries, we
no longer rely on the gossip data at any point, instead we adopt the systemconfigwatcher which
is backed by a rangefeed. Part of this change is to adopt that separate data source in the
GossipSubscription
implementation of theConnector
API.Subsequently, a number of tests needed updating.
Remove sql.catalog.unsafe_skip_system_config_trigger.enabled
This cluster setting is no longer useful.
Fixes #54477.
Fixes #70560.
Release note (sql change): The limitation that schema change statements in a transaction could not follow
DML statements has been lifted.